-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix warnings in test_csv.py. #10362
Fix warnings in test_csv.py. #10362
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10362 +/- ##
================================================
+ Coverage 10.42% 10.58% +0.15%
================================================
Files 119 125 +6
Lines 20603 21058 +455
================================================
+ Hits 2148 2228 +80
- Misses 18455 18830 +375
Continue to review full report at Codecov.
|
@shwina @galipremsagar @vuule what do you think about @bdice's question here? Should we be changing cuIO's Python bindings here to match pandas? For now I think silencing the pandas warning as we do here is fine, but we should add a TODO comment or something indicating how we want to handle this when pandas starts throwing the error. |
IIRC, we generally don't want to promise overflow checking in libcudf for performance reasons. So I'm fine with not following Pandas behavior here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a useful warning, as the test will be broken once Pandas behavior changes.
@bdice could you instead move the overflowing value to a separate test (that does not use Pandas)?
I read the conversation on #1925 again, and I now understand that the design intended to use signed values when parsing. That clarifies the intended behavior, so it's just a matter of separating this test as @vuule described. This design decision surprises me a bit (I would have expected |
6ddf6e8
to
e015b23
Compare
I added some tests that compare with NumPy and expand the tested range of overflow values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
def test_csv_reader_hexadecimal_overflow(np_dtype, gdf_dtype): | ||
# This tests values which cause an overflow warning that will become an | ||
# error in pandas. NumPy wraps the overflow silently up to the bounds of a | ||
# signed int64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it always 64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. By default, numpy treats types larger than int64/uint64 with object
as the dtype and uses a Python int
to back it. There are larger types, like np.int128
, but they're not used by default. The wider the type, the less-wide the support, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: I was thinking of floating types, not integral types. There are 128-bit float types (depending on the platform/build of NumPy) like np.longdouble
but 128-bit integers are not in the NumPy API. However, 128-bit integers seem to be used internally and references can be found in a few places in the source.
References:
- ENH: int128, uint128 support? numpy/numpy#9992
- https://github.com/numpy/numpy/search?q=int128&type=code
>>> np.finfo(np.longdouble)
finfo(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)
@gpucibot merge |
This PR silences warnings in
test_csv.py
. (I am working through one test file at a time so we can enable-Werr
in the future.)The only warning in this file is related to integer overflow in pandas. Currently, the test data is as follows:
cudf/python/cudf/cudf/tests/test_csv.py
Lines 1313 to 1319 in 21325e8
First, I note that this "hex" dtype is not part of the pandas API. It is a cuDF addition (#1925, #2149).
Note that there are dtypes for
int32
/hex32
, and the test data contains both a negative value-0x1000
and a value9512c20b
. The negative value-0x1000
has a sensible interpretation if the results are meant to be signed, but then the value9512c20b
is out of range (the maximum signed 32-bit value would be0x7FFFFFFF
and the minimum signed 32-bit value would be0x80000000
, using the big-endian convention of the parser). Recognizing this, pandas throws aFutureWarning
when parsing the data9512c20b
asint32
, and unsafely wraps it to a negative value. This behavior will eventually be replaced by anOverflowError
.In the future, we may need to decide if cuDF should raise an
OverflowError
when exceeding0x7FFFFFFF
for consistency with pandas, or decide to use unsigned integers when parsing "hex" dtypes and compare to pandas' unsigned types in this test.